Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation

نویسندگان

  • Daniel Lemire
  • Owen Kaser
چکیده

Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n− 1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass onehash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

One-Pass, One-Hash n-Gram Statistics Estimation

In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashin...

متن کامل

Recursive n-gram hashing is pairwise independent, at best

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time ...

متن کامل

Compressed Image Hashing using Minimum Magnitude CSLBP

Image hashing allows compression, enhancement or other signal processing operations on digital images which are usually acceptable manipulations. Whereas, cryptographic hash functions are very sensitive to even single bit changes in image. Image hashing is a sum of important quality features in quantized form. In this paper, we proposed a novel image hashing algorithm for authentication which i...

متن کامل

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often high-dimensional and binary (e.g., text n-grams), minwise hashing is widely adopted, which requires applying a large number of permutat...

متن کامل

Closed hashing is computable and optimally randomizable with universal hash functions

Universal hash functions that exhibit clog n-wise independence are shown to give a performance in double hashing, uniform hashing and virtually any reasonable generalization of double hashing that has an expected probe count of 1 1? +O(1 n) for the insertion of the n-th item into a table of size n, for any xed < 1. This performance is optimal. These results are derived from a novel formulation ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0705.4676  شماره 

صفحات  -

تاریخ انتشار 2007